Skip to content

WIP: Merge Dev to Main#2846

Merged
danielaskdd merged 622 commits into
mainfrom
dev
May 21, 2026
Merged

WIP: Merge Dev to Main#2846
danielaskdd merged 622 commits into
mainfrom
dev

Conversation

@danielaskdd
Copy link
Copy Markdown
Collaborator

Dummy PR: Merge Dev to Main (Never try to merge this PR)

danielaskdd and others added 28 commits May 9, 2026 19:58
…size resolution

- introduce `_apply_chunk_size_overlay` to reconcile `chunk_token_size` and `chunk_overlap_token_size` across config tiers
- change `chunk_token_size` and `chunk_overlap_token_size` fields to `Optional[int]` with `None` default
- update `default_chunker_config` to only read strategy-specific env vars, leaving slots empty for overlay fallback
- add precedence chain: addon_params explicit > strategy env > legacy constructor field > legacy env
- back-fill legacy instance fields after resolution for backward compatibility with downstream readers
- update Chinese documentation to reflect new configuration hierarchy and priority rules
- add comprehensive tests covering constructor overlay, addon_params precedence, strategy env wins, and legacy fallback
…h semantic strategy

- introduce CHUNK_P_SIZE env variable to decouple P strategy chunk size from global CHUNK_SIZE
- update default_chunker_config to parse and inject CHUNK_P_SIZE into paragraph_semantic options
- modify pipeline to extract and apply per-strategy chunk_token_size for P strategy with fallback to resolved top-level size
- document new env variable and configuration in Chinese docs with usage guidance
- add tests verifying env override behavior and fallback to global chunk size when unset
- add upper version bounds for langchain-text-splitters (<2) and langchain-experimental (<1)
- remove duplicate langchain 1.x and langchain-core 1.x entries from uv.lock
- add missing explicit dependencies (defusedxml, langchain-experimental, langchain-text-splitters) to api/evaluation/offline/test extras
- pin async-timeout to 4.0.3 for python < 3.11 to resolve version conflicts
…e file processing documentation

- reorganize document with numbered sections for server deployment workflow
- add quick start section with legacy, native, and combined configuration examples
- introduce detailed chunk_options configuration with environment variable reference
- add new chapter for python sdk usage covering runtime api and deprecated parameters
- improve clarity on engine fallback, validation, and priority chains
- relocate and expand storage layout, duplicate detection, concurrency, and resume rules sections
- add appendix for upgrade notes regarding deprecated multimodal global switch
…n params

- ensure chunk size configuration is reconciled when runtime addon params are set
- maintain consistency across all four configuration tiers
feat(chunker): add R/V chunkers and chunk_options snapshot mechanism
- move extraction-related settings below multimodal parsing section
- uncomment CHUNK_P_SIZE to set default value of 3000
- improve logical grouping by placing docling settings before extraction configs
…rategies

- introduce CHUNK_R_SIZE env variable for recursive character chunker
- introduce CHUNK_V_SIZE env variable for semantic vector chunker
- update env.example with new per-strategy size options and documentation
- modify pipeline to pop and apply strategy-specific chunk_token_size
- add tests for dedicated env override and fallback behavior for both R and V
- add _format_chunking_log helper to emit concise, scannable log lines
- alias long parameter keys to short forms for readability
- skip None and empty values to keep output compact
- log before each chunking strategy call (P, R, V, F, F(legacy))
- include chunk size, relevant params, and file path in every log line
…ptions

- document `CHUNK_R_SIZE` and `CHUNK_V_SIZE` environment variables
- add strategy-specific size fields to recursive_character and semantic_vector examples
- update priority chain to include new R and V size env variables
- clarify R size favors smaller targets for sentence splitting and V size acts as advisory ceiling
… to doc metadata

- rename _format_chunking_log to _format_chunking_params for reuse in both logging and metadata
- add chunk_opts_str to capture and persist actual chunker params to doc_status.metadata
- include chunk_opts in _DOC_STATUS_METADATA_CARRY_OVER_KEYS for visibility across status transitions
- replace three separate metadata fields with single compact string
- keep same information in "pre -> post" format while reducing noise
- signal split occurrence by field presence alone
- P chunker: anchor-less branch falls back to recursive_character
  splitting so chunk_token_size is honored even when no eligible
  paragraph anchor is available (e.g. dense academic prose).
  Previously the block was emitted as a single oversized chunk and
  relied on the embedding-time hard fallback, which uses
  embedding_token_limit (not chunk_token_size) and cannot enforce
  the user-configured size.

- V chunker: extend default sentence_split_regex to recognize CJK
  sentence terminators (。?!) so SemanticChunker actually produces
  sentences on Chinese / mixed-language input instead of treating the
  whole document as one. Add post-split size enforcement via R for
  any piece exceeding chunk_token_size, since SemanticChunker has no
  native size cap.

- R chunker: extend default separators with CJK punctuation
  (。!?;,) so Chinese documents split at semantic boundaries
  instead of falling through to character-level splitting. English
  '.?!' intentionally excluded — literal match would split numerals
  (0.95) and abbreviations (e.g.).

- Expose CHUNK_V_SENTENCE_SPLIT_REGEX env var (alongside existing
  CHUNK_R_SEPARATORS) so users can customize per deployment.

- Move shared defaults (DEFAULT_R_SEPARATORS,
  DEFAULT_SENTENCE_SPLIT_REGEX) to constants.py as the single source
  of truth.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ment

Sync FileProcessingConfiguration-zh.md with the chunker fixes:

- §2.5 process options table: explain new R default cascade (CJK
  punctuation tier), V's CJK-aware sentence splitter and post-split
  R-based size enforcement, and P's anchor-less fallback to R.

- §3.2 env vars table: update CHUNK_R_SEPARATORS default, switch
  CHUNK_V_SIZE description from "advisory ceiling" to "hard cap",
  and document the new CHUNK_V_SENTENCE_SPLIT_REGEX env var.

- §3.4 chunk_options JSON example: reflect new R separators default
  and add semantic_vector.sentence_split_regex field.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- add "breakpoint" to "break" alias
- add "buffer" to "buf" alias
- add "sentence_split_regex" to "regex" alias
fix(chunker): CJK punctuation support and chunk_token_size enforcement
- remove redundant alias mappings for "breakpoint" and "buffer"
- shorten "breakpoint_threshold_type" alias from "breakpoint" to "break"
- shorten "buffer_size" alias from "buffer" to "buf"
- remove duplicate blocks_path from p_opts passed to _format_chunking_params
- prevent potential key collision since blocks_path is already extracted separately
…ic chunker

- reduce CHUNK_P_SIZE in env.example from 3000 to 2000 for consistency
- update chunk_token_size default in paragraph_semantic.py from 1200 to 2000
- merge AGENTS.md content into CLAUDE.md and remove duplicate file
- update project structure to reflect current module layout
- add workspace isolation details and pipeline concurrency contract
- include WebUI commands, testing scripts, and setup wizard outputs
- remove redundant sections and streamline common issues
- rename CLAUDE.md to AGENTS.md for generic AI agent usage
- replace full CLAUDE.md content with reference to AGENTS.md
- update .gitignore to use broader "AI Agent files" terminology
… fallback

- detect table format (json / html / unknown) via explicit format=
  attribute, fall back to body sniffing when attrs are silent
- split json tables on top-level row items and html tables on <tr>
  boundaries; only when no row boundary is available, or a single row
  alone exceeds the cap, drop to character-level fallback
- apply the same table-aware fallback in stage C anchor-driven
  long-block re-split so non-table residuals are character-split while
  oversized tables retain row integrity
- tests cover detect / html row extraction / json splitting / combined
  dispatcher; existing _expand_block_with_table_splits paths unchanged
- account for table wrapper overhead in row splitter budgets to prevent post-wrap overflows
- add recursive re-splitting for table chunks that still exceed target_max after wrapping
- debit newline separator tokens in no-anchor greedy packing to enforce target_max strictly
- add tests for separator token accounting and wrapper overhead budgeting
- fix missing newline at end of file to follow POSIX standard
- remove single-paragraph early return and recursive guard to allow character-level splitting of oversized single paragraphs
- re-measure joined content after separator tokens in tail absorption to prevent silent overflow
- disable chunk overlap in recursive character fallback to honor non-overlapping contract
- add regression tests for merge boundary checks, single-paragraph split, and fallback overlap behavior
danielaskdd and others added 28 commits May 20, 2026 13:55
…TNxo8

feat(opensearch): add basename and content_hash lookups for doc status
…tion

- eliminate unnecessary `:-/` fallback in redis uri path capture
- ensure exact path preservation from original uri during local service normalization
… setup scripts

- add /app/data/prompts directory creation in dockerfile and dockerfile.lite
- add PROMPT_DIR environment variable and volume mounts in all compose files
- update setup scripts to support PROMPT_DIR configuration and idempotent mount injection
- fix redis test default uri to remove trailing slash
- consolidate verbose log strings in parse_mineru and parse_docling to reduce noise
- shorten analyze_multimodal opt-in missing and backfill log lines for clarity
- remove redundant file_path references from completion and cache hit logs
- update chinese documentation to match simplified log format
… methods

- remove default implementations of get_doc_by_file_basename and get_doc_by_content_hash
- add @AbstractMethod decorator to enforce implementation in subclasses
- clean up unused asdict import from dataclasses module
- simplify docstrings to reflect abstract nature of methods
…ation

- correct the info log message format for empty equations sidecar in analyze_multimodal
- replace specific entity_type subdirectory with entire prompts directory
- update comment to reflect user customized prompt directory purpose
- disable default memgraph port exposure for improved security in template
- allow users to opt-in to port exposure via environment configuration if needed
- replace file_path with doc_id in chunking log messages for better traceability
- apply consistent logging format across all chunking strategies (P, R, V, F, legacy)
- change ignored path from entire prompts directory to specific entity_type subdirectory
- add documentation for user-defined prompts folder purpose
- clarify default behavior when ENTITY_EXTRACTION_USE_JSON is unset
- improve description of json output trade-offs with latency and reliability
…load feedback

Backend:
- /health derives pipeline_active = busy || scanning || destructive_busy || pending_enqueues > 0
- Also exposes pipeline_scanning / pipeline_destructive_busy / pipeline_pending_enqueues
- Closes the gap where the scan classification phase set only `scanning` and
  the pipeline-busy button stayed grey for 5~10s

Frontend:
- Add activity probe: exponential-backoff /health bursts at t=0/1/2/4/8/16s
  fired by scan_started and the first successful upload in a batch. Exits as
  soon as both pipelineActive=true AND the document list has caught up.
- Add refreshDocumentsThrottled(): wall-clock 2s minimum between any two
  /documents/paginated requests, with trailing-call coalescing.
- Scan/upload no longer rely on resetHealthCheckTimerDelayed + adhoc fast
  polling windows — probe + active polling cover both paths.
- Polling stays at 5s while pipelineActive=true even if doc list hasn't
  surfaced new rows yet, so the 30s idle gap right after scan disappears.
- Stale trailing refresh is dropped via latestRefreshRequestVersionRef check
  so 2s-window page/filter/sort changes can't be overwritten by a captured
  old query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(pipeline-status): probe + throttled refresh for prompt scan/upload feedback
Add English versions of the three Chinese-only docs:
- FileProcessingPipeline.md
- LightRAGSidecarFormat.md
- ParserDebugCLI.md

https://claude.ai/code/session_01PEf2XkGrpo79D43GVPWn3G
- remove legacy upgrade appendix about deprecated global multimodal switch
- keep both chinese and english documentation in sync
- add RagAnything merge announcement with MinerU/Docling support
- document four new text chunking strategies
- add role-specific LLM configuration details
P (paragraph_semantic) chunking now uses DEFAULT_CHUNK_P_SIZE (2000) when
CHUNK_P_SIZE env is unset, instead of silently inheriting the global
CHUNK_SIZE / LightRAG(chunk_token_size=...). Paragraph-semantic merging
needs more headroom than the global default to keep related paragraphs
together; inheriting the smaller global ceiling defeats the strategy's
purpose.

Precedence (high → low):
  caller-supplied paragraph_semantic.chunk_token_size
  > CHUNK_P_SIZE env
  > DEFAULT_CHUNK_P_SIZE (2000)

The backfill lives in slim_chunk_options() — the single chokepoint
shared by both enqueue paths (resolve_chunk_options + caller-supplied
chunk_options=). _apply_chunk_size_overlay() carries a mirror backfill
so direct addon_params introspection sees the resolved value too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-default

feat(chunker): give P strategy a dedicated default chunk_token_size
…d modalities

- remove idempotent skip logic for existing llm_analyze_result entries
- overwrite prior success/skipped/failure results on each run for enabled modalities
- allow retry after fixing vlm/extract configuration without manual sidecar cleanup
- rely on llm analysis cache to avoid redundant provider calls when inputs unchanged
- update docs and tests to reflect new non-idempotent overwrite behavior
… response logging

- add `raise_for_status_with_detail` and `response_error_detail` helpers to `_common.py`
- replace ad-hoc status checks in docling and mineru clients with unified helper
- include compact response body snippets in error messages for faster debugging
- add test coverage for HTTP error preservation and non-2xx handling in docling and mineru
…format

- add lightrag_load_errors collection to track blocks.jsonl read failures
- skip documents with unreadable blocks instead of creating false "{{LRdoc}}" entries
- flush failed stubs via apipeline_enqueue_error_documents inside critical section
- return track_id on failure-only batches instead of None to prevent silent archival
- expose file_size and original_error in failure records for better debugging
- add reference to FileProcessingPipeline.md documentation for parser setup
- change example LIGHTRAG_PARSER from commented to active with new default pattern
- update parser pattern to use native-teP and legacy-R fallbacks
…d paragraph semantic chunking documentation

- update both zh and en quick start sections with clearer legacy, recommended and multimodal scenarios
- replace mineru-centric examples with native-teP and legacy-R combinations
- add new comprehensive ParagraphSemanticChunking.md with full P strategy documentation
- remove outdated native-only docx examples and docling references
- align zh docs with en structure and terminology
@danielaskdd danielaskdd merged commit b62c260 into main May 21, 2026
4 of 5 checks passed
@danielaskdd danielaskdd deleted the dev branch May 25, 2026 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants